gmantini5011@floridapoly.eduWas browsing through some of the datasets that had been featured in TidyTuesday, and the Nobel Laureate data caught my eye. I had little to no prior knowledge on Nobel Laureate winner data, so I was curious to learn basic information such as the number of winners across different fields/regions. Furthermore, it seemed like a fitting data set that would be able to incorporate relevant visualizations that were asked for. I.e., the spatial visualization being able to be used for showing distribution of winners from around the world. For the scope of this work, I decided to focus on the nobel winners data and not the data that included all of the separate publications from nobel winners.
Load Libraries
library(tidyverse) #mostly for ggplot
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate) #working with dates/times
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(plotly) #interactive plots
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(sf) #world shapefile
## Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1
Set theme
theme_set(theme_bw())
Read in data, add column to sort by decade for cleaner plots
nobel_winners <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-14/nobel_winners.csv") %>% #read_csv is significantly faster here than read.csv
distinct(full_name, prize_year, category, .keep_all = TRUE) %>%
mutate(decade = 10 * (prize_year %/% 10),
age = prize_year - year(birth_date))
Glimpse at Nobel Winners
glimpse(nobel_winners)
## Rows: 911
## Columns: 20
## $ prize_year <dbl> 1901, 1901, 1901, 1901, 1901, 1901, 1902, 1902, 1~
## $ category <chr> "Chemistry", "Literature", "Medicine", "Peace", "~
## $ prize <chr> "The Nobel Prize in Chemistry 1901", "The Nobel P~
## $ motivation <chr> "\"in recognition of the extraordinary services h~
## $ prize_share <chr> "1/1", "1/1", "1/1", "1/2", "1/2", "1/1", "1/1", ~
## $ laureate_id <dbl> 160, 569, 293, 462, 463, 1, 161, 571, 294, 464, 4~
## $ laureate_type <chr> "Individual", "Individual", "Individual", "Indivi~
## $ full_name <chr> "Jacobus Henricus van 't Hoff", "Sully Prudhomme"~
## $ birth_date <date> 1852-08-30, 1839-03-16, 1854-03-15, 1828-05-08, ~
## $ birth_city <chr> "Rotterdam", "Paris", "Hansdorf (Lawice)", "Genev~
## $ birth_country <chr> "Netherlands", "France", "Prussia (Poland)", "Swi~
## $ gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "~
## $ organization_name <chr> "Berlin University", NA, "Marburg University", NA~
## $ organization_city <chr> "Berlin", NA, "Marburg", NA, NA, "Munich", "Berli~
## $ organization_country <chr> "Germany", NA, "Germany", NA, NA, "Germany", "Ger~
## $ death_date <date> 1911-03-01, 1907-09-07, 1917-03-31, 1910-10-30, ~
## $ death_city <chr> "Berlin", "Châtenay", "Marburg", "Heiden", "Paris~
## $ death_country <chr> "Germany", "France", "Germany", "Switzerland", "F~
## $ decade <dbl> 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1~
## $ age <dbl> 49, 62, 47, 73, 79, 56, 50, 85, 45, 69, 59, 49, 3~
Linear fit of prize year as a function of age
ggplot(nobel_winners, aes(x = prize_year, y = age)) +
geom_point() +
geom_smooth(method = "lm",
formula = "y ~ x") +
theme_minimal()
Get the jist that prize recipients have been older over time at the time of receiving. Let’s take a closer look:
How has age of Nobel Prize recipients changed over time?
nobel_winners %>%
filter(!is.na(age)) %>%
group_by(decade) %>%
summarize(average_age = mean(age),
median_age = median(age)) %>%
ggplot(aes(decade, average_age)) +
geom_line()
Definitely a noticeable spike since the mid 1900’s.
How about over different categories?
Boxplot view of age per category
nobel_categories <- nobel_winners %>%
mutate(category = fct_reorder(category, age, median, na.rm = TRUE)) %>%
ggplot(aes(category, age)) +
geom_boxplot() +
coord_flip()
ggplotly(nobel_categories)
Get to see a couple interesting outliers here.
nobel_winners %>%
filter(age == 17 | age == 90) %>%
arrange(full_name) %>%
select(full_name,age,prize_year,category)
Change of age over time across categories
nobel_age <- nobel_winners %>%
filter(!is.na(age)) %>%
group_by(decade, category) %>%
summarize(average_age = mean(age),
median_age = median(age)) %>%
ggplot(aes(decade, average_age, color = category)) +
geom_line()
ggplotly(nobel_age)
Average age across the board has increased for the most part, especially for the “hard sciences”. Literature is the oldest, and consistently has been for a while. Peace has by far been the most volatile, notably skewed by a few data points. For the 2010’s, our data is skewed by having the 17 year old Malala Yousafzai win in 2014, in addition for data being only collected up to 2016 for this dataset.
Going back to plotting age as a function of prize_year: Adding categories with trend lines.
ggplot(data = nobel_winners, aes(x = prize_year, y = age))+
geom_point()+
geom_smooth()+
facet_wrap(~ category)
Can confirm that the sciences have taken a noticeable trend upwards, while Peace has actually had a downward trend. Economics has had little change, with Literature having a slight trend upward.
Now taking a look at distribution of birth country for Nobel Prize winners:
nobel_winners %>%
filter(!is.na(birth_country)) %>%
count(birth_country = fct_lump(birth_country, 25),
sort = TRUE) %>%
mutate(birth_country = fct_reorder(birth_country, n)) %>%
ggplot(aes(birth_country, n)) +
geom_col() +
#facet_wrap(~ category) +
coord_flip()
The United States dominates the rest of the field, having well over 3 times the number of recipients as the next highest country (U.K.)
How does birth country look across different categories?
nobel_regions <- nobel_winners %>%
filter(!is.na(birth_country)) %>%
count(birth_country = fct_lump(birth_country, 10),
category,
sort = TRUE) %>%
mutate(birth_country = fct_reorder(birth_country, n)) %>%
ggplot(aes(birth_country, n, fill = category)) +
geom_col() +
facet_wrap(~ category) +
coord_flip()
ggplotly(nobel_regions)
United States is dominant in Chemistry, Medicine, Physics, and Economics. Literature and Peace are the most evenly divided.
How does birth country look across gender?
nobel_winners %>%
filter(!is.na(birth_country)) %>%
count(birth_country = fct_lump(birth_country, 10),
gender,
sort = TRUE) %>%
mutate(birth_country = fct_reorder(birth_country, n)) %>%
ggplot(aes(birth_country, n, fill = gender)) +
geom_col() +
facet_wrap(~ gender) +
coord_flip()
Closer look at female distribution:
nobel_winners %>%
filter(!is.na(birth_country)) %>%
filter(gender == "Female") %>%
count(birth_country = fct_lump(birth_country, 10),
gender,
sort = TRUE) %>%
mutate(birth_country = fct_reorder(birth_country, n)) %>%
ggplot(aes(birth_country, n)) +
geom_col() +
facet_wrap(~ gender) +
coord_flip()
All countries that have had multiple recipients are the United States and Europe, outside of Liberia. Can likely attribute to higher degree of independence and accessibility of education in those areas for women.
Spatial Visualization:
Load world shapefile from Natural Earth
# https://www.naturalearthdata.com/downloads/110m-cultural-vectors/
world_shapes <- read_sf("ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp")
#was having some real problems trying to path to this file when i put it in the data folder, so i cheated a bit and put it in my report folder as well
Rename nobel_winners in temp variable to join on to
nobel_winners_geo <- nobel_winners
names(nobel_winners_geo)[names(nobel_winners_geo) == 'birth_country'] <- 'GEOUNIT'
Join on country
nobel_map <- nobel_winners_geo %>%
left_join(world_shapes, by = "GEOUNIT")
Map showing the average age of individuals in each (birth) country that a person was awarded a Nobel Prize.
# Make a map of Noble Prize winners with ggplot() + geom_sf()
ggplot() +
geom_sf(data = nobel_map, aes(geometry = geometry, fill = age),
color = "white", size = 0.15) +
coord_sf(crs = "+proj=robin") +
scale_fill_gradient2(labels = scales::comma) +
labs(fill = NULL) +
theme_void() +
theme(legend.position = "bottom") +
ggtitle("Average Age of Nobel Prize Winners For Their Respective Birth Country")
As seen above, this correlates to the number of scientists being proportionally higher in these darker regions, and the scientists are getting their rewards at an older age. That being said, the frequency of rewards in certain countries may skew their regions.
Nobel Prizes are predominantly won by men in the United States and Europe, who are in their late 60s. Outside research also shows that the work they won an award for was published ~25 years ago, which was approximately 20% of the way in to their career at that point. This is notably more true for the sciences, with the Peace and Literature Prizes having more variation in age and birth country.
Future possible work:
Could explore other variables such as organizations, death country, affiliation, prize split, and more. Some ideas include doing spatial visualizations with Universities, countries of death, and also doing text mining on the ‘motivation’ and ‘title’. Could also dive in to the partner data set, the nobel_winner_all_pubs, in which analysis could be done on possible variables such as papers before, total papers, position in career, first publication year among others.